Skip to content

perf(K2.5): optimize small kernels in EAGEL3 drafter loop#142

Merged
LorrinWWW merged 11 commits into
lightseekorg:mainfrom
syuoni:opt-eagle3-draft-part2
May 14, 2026
Merged

perf(K2.5): optimize small kernels in EAGEL3 drafter loop#142
LorrinWWW merged 11 commits into
lightseekorg:mainfrom
syuoni:opt-eagle3-draft-part2

Conversation

@syuoni
Copy link
Copy Markdown
Member

@syuoni syuoni commented May 14, 2026

Summary

Per-call overhead reductions in the EAGLE3 drafter loop and surrounding metadata prep.

Changes

  • compute_out_cache_loc uniform variant — added compute_out_cache_loc_uniform
    for the drafter's multi-step decode where every request has input_length=1. Skips
    the per-call torch.cumsum + torch.full host-side work and the kernel's GMEM
    reads of input_lengths_ptr/cumsum_lengths_ptr (Triton specializes on
    None-pointer at JIT time).
  • req_pool_indices_buf is now int64 — eliminates ~12 implicit int32→int64
    unrolled_elementwise cast kernels (~1.6 µs each) along the per-iteration
    metadata prep path. Pairs with switching the valid_cache_lengths[idx] fancy
    index to valid_cache_lengths.index_select(0, idx) so int32 indices are still
    accepted natively where they remain (e.g. in index_add_).
  • Persistent drafter buffersdraft_seq_lens_buf, draft_out_cache_loc_buf,
    draft_input_lengths_buf, and last_index_offsets_buf (= arange(max_bs) * spec_num_tokens - 1) are hoisted to Eagle.__init__ to avoid per-call alloc +
    init. last_index_offsets is now plumbed via ForwardContext to
    LogitsProcessor for the padded-static-len last-token-select path.
  • Eagle.draft() cleanup — fused cache_start + 1 into
    torch.add(..., out=draft_seq_lens); replaced the post-draft() torch.cat
    with direct writes into a shared next_tokens[bs, spec_num_steps+1] buffer;
    skip the last-iter positions.add_(1) / draft_seq_lens.add_(1) since they're
    not consumed; removed the dead logits.shape[0] != bs fallback branch.
  • Persistent drafter buffersdraft_seq_lens_buf, draft_out_cache_loc_buf,
    draft_input_lengths_buf, and last_index_offsets_buf are hoisted to
    Eagle.__init__ to avoid per-call alloc + init. The last one
    (= torch.arange(max_bs) * spec_num_tokens - 1, int64) replaces two
    per-call torch.arange(bs, ...) * spec_num_tokens patterns — one in the
    drafter's last-verified-id selection, one in LogitsProcessor's padded-static-len
    last-token-select. The precomputed buffer is sliced to [:bs] and plumbed via
    ForwardContext.last_index_offsets so LogitsProcessor can skip the
    arange + mul + sub triplet. Net: pre-step-0 last-token-select drops from
    6 kernels (arange + mul + sub + 2 gathers) to 3 (1 add + 2 gathers).

syuoni added 10 commits May 14, 2026 13:28
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
@syuoni syuoni force-pushed the opt-eagle3-draft-part2 branch from fc15f9a to 80fd19b Compare May 14, 2026 13:31
@syuoni syuoni changed the title [WIP] perf(K2.5): optimize small kernels in EAGEL3 drafter loop perf(K2.5): optimize small kernels in EAGEL3 drafter loop May 14, 2026
@syuoni syuoni marked this pull request as ready for review May 14, 2026 13:33
@syuoni syuoni requested a review from a team as a code owner May 14, 2026 13:33
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@LorrinWWW LorrinWWW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@LorrinWWW LorrinWWW merged commit 82b4e49 into lightseekorg:main May 14, 2026
52 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants